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Abstract. The growing amount of information in the world has increased the need for com¬ 
puterized classification of different objects. This situation is present in higher education as 
well where the possibility of effortless detection of similarity between different study courses 
would give the opportunity to organize student exchange programmes effectively and facilitate 
curriculum management and development. This area which currently relies on manual time- 
consuming expert activities could benefit from application of smartly adapted machine learn¬ 
ing technologies. Data in this problem domain is complex leading to inability for automatic 
classification approaches to always reach the desired result in terms of classification accuracy. 
Therefore, our approach suggests an automated/semi-automated classification solution, which 
incorporates both machine learning facilities and interactive involvement of a domain expert 
for improving classification results. The system’s prototype has been implemented and experi¬ 
ments are carried out. This interactive classification system allows to classify educational data, 
which often comes in unstructured or semi-structured, incomplete and/or insufficient form, thus 
reducing the number of misclassified instances significantly in comparison with the automatic 
machine learning approach. 

Keywords: machine learning, interactive classification, inductive learning, curricula comparison. 
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1. Introduction 


The growing amount of available information in the world encourages the use of 
automatic data processing techniques that reduce human routine work. This is the 
place for artificial intelligence and its subfield machine learning. Education belongs 
to areas where extensive data exploration is needed. The research is focused on the 
study course compatibility analysis in higher education. The comparison of study pro¬ 
grammes and courses is necessary in several educational tasks. One of them is student 
mobility. Taking into consideration the number of different education institutions op¬ 
erating inside the global knowledge provision space this is a time consuming task. 
Although one of the main features of the Bologna process is to encourage creation of 
a common model for Higher Education in Europe (Kennedy et al., 2009), there still 
does not exist a generally established standard for describing study courses in all uni¬ 
versities, and they currently appear both as semi-structured and unstructured textual 
descriptions. This fact creates the main difficulty for course comparison automatically. 
Therefore, in reality comparison of study programmes and individual courses is a task 
that is performed manually. 

Application domains are getting more complex in terms of data amount, repre¬ 
sentation forms, relationships within data etc. For this reason machine learning ap¬ 
proaches face new challenges in solving tasks which could benefit from automated 
solutions but do not conform to typical machine learning application areas. Classifi¬ 
cation is one of the machine learning tasks where the program learns to predict class 
label of new instances from a human or environment provided facts. Classification 
process can be devidied into classifier building (or training), testing and applying 
steps. From all range of classification approaches we consider inductive learning 
algorithms in a form of decision trees and rules. They are widely used in machine 
learning tasks and hold a strong position as reliable classification methods that can 
explain the way how the decision is being made (Aksoy, 2008). In computer science, 
inductive learning is learning by example, where a system tries to induce a concept 
description c:X -> L from a set of observed instances X = { x u ..., jc,- } with a known 
set of class labels L = { l u ..., / ; }. Each instance x consists of attribute-value pairs 
{( a U V a \ ), ..., (a„, V an )}■ 

This work can be characterized as applied, experimental and quantitative research. 
It is aimed at developing an automated or semiautomatic classification solution which 
incorporates both machine learning facilities and interactive involvement of a domain 
expert in the classifier’s applying stage for improving its results if the classifier makes 
uncertain classification. The rest of the paper is organized as follows. Research objec¬ 
tives and related work from both educational and machine learning aspects are given 
followed by interpretation of study course comparison task in machine learning con¬ 
text. We describe contents of developed Interactive Classification System’s (InClaS) 
framework. An instance of such system is built and used to carry out study course 
comparison empirically. Experimental settings, achieved results and conclusions sum 
up this research paper. 
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2. Research Objectives and Related Work 


Motivation of the reseach and development on interactive inductive learning based clas¬ 
sification system comes from several sides. One of them is inappropriateness of the auto¬ 
mated classification methods for all domains where machine techniques could be applied 
to. Other facilitator for developing an interactive classification system is the practical 
need in the area of curricula comparison. We will discuss both of these issues briefly. 

Nowadays information is often organized in complicated forms for machine learn¬ 
ing, like plain (unstructured) text, graphs, semi-structured text, etc. The transformation 
from the original data to the classifier-acceptable data structures is needed, and in this 
process some information can get lost or mapped inaccurately. This leads to creation of 
an incomplete classifier that does not generalize well the problem domain and probably 
will not be able to make predictions for all new unseen instances when the classifier is 
applied. We state that the solution for this problem is creation of a semi-automatic clas¬ 
sification system to give the expert a wider control over the classification process and 
use his/her knowledge for gradual improving of it. 

Considering educational document comparison, a term educational document is used 
to denote different types of materials for educational content and assessment, including 
course descriptions, teaching materials, academic credentials, etc. The necessity to com¬ 
pare educational documents appears in different forms and can be conditionally divided 
into three categories (Alves and Figueira, 2011; Anohina-Naumeca et al, 2012; Bilets- 
kiy et al., 2009; Biletska et al, 2010; Ranganthan et al., 2006; Rudzajs and Kirikova, 
2012, 2009; Dagiene et al, 2013; Teodosiev and Nachev, 2012). These categories are: 

(1) Student exchange programmes. 

(2) New curriculum development. 

(3) Teaching material and learning object categorization for, e.g. e-learning systems. 

In the scope of this paper we consider only the first category and target mutual com¬ 
parison of course content. 

Examples of study course textual descriptions are given in Fig. 1 to demonstrate their 
variety. 

Although there are attempts to put it this way, study course comparison does not 
fully belong to the problem of text classification. Study course description most often is 
a semi-structured text which usually includes sections like “prior knowledge”, “learning 
outcomes”, etc. It is important to distinguish between these sections. Besides, a semi- 
structured text has a significantly richer and more complicated structure than a plain¬ 
text, and the relation among semi-structured documents is harder to be fully utilized if 
only text categorization is used (Sebastiani, 2002; Jianwu Yang and Chen, 2002; Midler, 
2010). There could help the study course comparison approach which uses formalized 
semantically meaningful attributes and interactive approach. 

Existing research in the area of curricula comparison does not solve the problem 
of study course comparison. It has been proposed to represent study programmes as 
concept maps, and a system based on schema matching (Saleem et al, 2008) of concept 
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Code 

DSP701 

Course title 

Knowledge Management Systems 

Course status in the programme 

Compulsory 7 Courses of Limited Choice 

Course level 

Post-graduate Studies 

Course type 

Academic 

Field of study 

Computer Science 

Responsible instructor 

Kirikova Marite 
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1 part. 4.0 Credit Points. 6.0 ECTS credits 

Language of instruction 

LV. EN 
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Abstract 

In this course students will leam about the concepts of organisational learning and knowledge, 
essential factors of organisational learning, knowledge flow and networks and technologies 
supporting them Human-computer interaction and interface design will be discussed. Students 
will leam to define knowledge management strategy, to design knowledge management systems, 
to plan the development of these systems and will be familiar with different knowledge 
management technologies. 

Goals and objectives of the course in terms of 
competences and skills 

Successful completion of this course will provide students with the content and skills necessary to: 
explain the impact of the nature of knowledge on the management of knowledge; understand and 
interpret the concept and objectives of knowledge management in terms of advanced business 
practices and technologies; analyse knowledge processes within an organisation in terms of 
organisational performance and development; identify approaches (tools and techniques) that 
organisations may take to make a contribution to organisation’s knowledge processes; understand 
the need for equal consideration of technological, human and organisational aspects; identify 7 and 
define the best approach of knowledge. 

Structure and tasks of independent studies 

In individual assignments students will explore and analyse knowledge management solutions 


MB I 665 Knowledge Management and Decision Support 

This course introduces students to knowledge management 
practices and the technologies collectively called decision support 
systems. To cover the most current topics affecting how individuals 
and organizations use computerized support in making decisions. 
Business applications of data warehouses, online analytical 
processing, group support systems, knowledge acquisition and 
representation, knowledge management, knowledge-based 
decision support and intelligent systems will be explored. 
PREREQ: MBI 625 MBI 625 


Fig. 1. Study course description in Riga Technical University (http: //www. rtu. lv/, 2013) (top) 
and Northern Kentucky University (http: / /www. nku. edu/, 2012) (bottom). 


maps has been developed (Anohina-Naumeca et al, 2012). In this approach curricula 
are compared according to their structure. However, one of the basic tasks in comparing 
curricula is the comparison of individual courses in the course content level that has not 
been included in this research. Unsupervised classification mechanism presented by (Al¬ 
ves and Figueira, 2011) organizes educational documents from e-learning system into 
clusters. Design of (Ranganthan et al, 2006) describes methodology for classification of 
learning objects which can appear in different forms, e.g. course outlines and transcripts 
without well-defined metadata. Classification of a new learning object is done by find¬ 
ing the smallest distance to the cluster, where clusters define subdomains of interest. 
However, in (Alves and Figueira, 2011) and (Ranganthan et al., 2006) it is not the course 
description that is used as the input. Both of these approaches also assume that objects 
relate to only one category, although in practice it is not the case when comparing docu¬ 
ments in distinct curricula. 

As claimed in (Coletta et al., 2011), comparative analysis of educational documents 
is a complicated task both for experts and computer systems. Therefore automation of 
this process requires specific approaches and expert participation. Semi-structured docu¬ 
ment representation requires the use of various information extraction methods. Authors 
of Academic e-Advising system (Biletskiy et al, 2009) point out that system’s results 
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could undoubtedly be improved by expanding the size of the training corpora and in¬ 
volving an expert. The system would also benefit from the implementation of an easy 
mechanism for manual inspection and augmentation of the extracted data to improve 
data quality for further use. 

Analysis of related works shows that educational document comparison requires au¬ 
tomated but not automatic approach to receive reliable results. It is also worth noting that 
despite the fact that different study programmes do not have the same granularity and 
content distribution between courses (Biletska et al, 2010; Ranganthan et al., 2006), the 
course similarity has been considered only using one-to-one correspondence. None of 
the presented systems so far deals with the possible one-to-many correspondences be¬ 
tween courses or uses multi-label classification approach. Although the need for expert 
involvement has been emphasized, methods used so far do not foster collaboration with 
an expert. 

A multi-label class membership requires the use of appropriate and more sophisticat¬ 
ed classification methods. Multi-label classification is useful in practice, when an object 
naturally belongs to more than one category (Thabtah et al, 2005). In multi-label clas¬ 
sification, examples are associated with a set of labels Y Q L where Lisa set of labels 
in contrast with the traditional single-label classification where examples are associated 
with a single label l from L, \L\ > 1. 

As the classification task in course comparison context is complicated because of in¬ 
sufficient amount of training examples and possibly incomplete formalized study course 
descriptions, the automatic classifier may not make enough informed decision on its 
own. It may happen that none of the classification rules fit the new instance when the 
classifier is applied. There are several methods to deal with this problem. Inductive 
learning systems with a low number of unclassified instances usually apply a default rule 
for classifying new instances that none of the rules in the rule base can classify (Clark 
and Niblett, 1989). A default rule comes from CN2 (Clark and Niblett, 1989) and AQ 
(Michalski et al, 1986) algorithms and predicts the most common class in a particular 
data set. If a data set contains many classes and, moreover, if all of them occur equally 
frequently, assigning one certain class to all unclassified instances will not lead to a high 
accuracy of the classifier. Even more, most of nowadays classification algorithms do not 
admit their inability to classify instance but classify it anyway (correctly or incorrectly) 
making it harder for the system’s user to detect the boundary of “real knowledge” of the 
classifier. Therefore, the interactive semi-automatic approach which takes into account 
confidence with which the classifier makes its decision should be used. 

The authors of this paper have analysed and summarized a number of papers (Ok- 
abe and Yamada, 2002; Tanumara et al., 2007; Buntine and Stirling, 1991; Hadjimi- 
chael and Wasilevska, 1993; Wong and Laung, 2000; Li et al., 2009) referring to the 
concept interactive inductive learning and exploring the idea of user interaction in 
a concept learning process. Depending on phase in the classification process where 
a human interaction is expected, a diagram for abstract comprehension of different 
existing approaches to the interactive inductive learning has been created (see Fig. 2). 
In the stage of classifier building, data is passed to the learning algorithm (phase A) 
and classification rules are given to output (phase B). In the stage of classifier ap- 
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Fig. 2. Phases when an expert can interact with the classifier. 


plying, a new instance (instances) with no classification is provided to the classifier 
(phase C) and a decision of its class is expected to be received (phase D). Methods de¬ 
scribed in aforementioned works provide interaction with an expert in phases A, B or 
D which is either too early or too late to handle new instances that the classifier cannot 
classify, but not in phase C when a particularly hard-to-classify instance arrives. Spe¬ 
cial methods of interactive classification - active learning (Settles, 2010) and Ripple 
Down Rules (Brian and Compton, 1995) - have been also considered. In our research 
we are dealing with phase C. 

According to the presented related work, target of the research is defined as develop¬ 
ment of inductive learning based interactive multi-label classification system for sup¬ 
porting study course comparison. 


3. Interpretation of the Course Comparison in Machine Learning Context 

According to the related work the problem domain - university study course compara¬ 
tive analysis - can be defined by the following features which intended machine learning 
solution should take into account. 

• Understanding decision making steps is important for the classifier’s user and 
the expert. 

• Available initial learning base (in this case - expert-made course comparisons) 
is small. 

• Initial data (textual course descriptions) is semi-structured or unstructured. 

• Domain defines many classes (course labels) with equal frequency. 

• Each object (study course) can have a multi-label class membership (correspon¬ 
dence). 

This subsection clarifies the study course comparison as a classification task. To do 
it, we need to define attributes and classes. Study course description does not naturally 
possess well-defined attributes. To apply inductive learning or other classification meth- 
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ods, a formalized attribute-value based representation is to be achieved. For practical 
implementation of university course comparison two main settings are chosen - direct 
and indirect comparison. For direct comparison, text classification approach is applied 
which makes use of word vectors obtained from full course descriptions. Indirect com¬ 
parison involves mediating framework for extracting semantically meaningful informa¬ 
tion from course descriptions. Meaningful and usually accessible course attributes are 
learning outcomes, study level and the number of credit points. Learning outcomes can 
be described in different ways; hence, a need for unification arises. European e-Com- 
petence Framework (e-CF) (European E-Competence Framework, 2012) is chosen as a 
mediating framework because it is European-wide framework and is oriented to learning 
outcomes that are important for course comparison. 

For the training set, an expert defines classes (i.e. detects correspondences) to un¬ 
known study courses. Note that the expert can assign more than one class since the 
courses can overlap in their content. 

Fig. 3 demonstrates an overview of way for achieving formalized course attributes 
and detected classes in direct and indirect comparison. Formalization is done in order to 
prepare appropriate input data format for classification algorithms. 

It is worth noting that the attribute selection in this task is not predefined. Data sets 
extracted in direct and indirect comparison are used separately; therefore, practical ex¬ 
periments can demonstrate the classifier’s ability to generalize from provided attributes 
in both representations and provide a justification for preferring one or another. 
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Fig. 3. Approach for formalizing study course comparison. 
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4. Framework of Interactive Inductive Learning Based Classification System 
(INCLAS) 


The proposed framework is developed to define how to create interactive classifica¬ 
tion system for particular implementation. Various components extending traditional 
classification system are designed (Birzniece, 2010; Birzniece and Kirikova, 2011; 
Birzniece and Rudzajs, 2011). These components are extended and amalgamated in 
the Interactive Inductive Learning Based Classification System’s (InClaS) frame¬ 
work. Fig. 4 depicts three levels of this framework which are explained in short after¬ 
wards in this section. 

4.1. Generic Model 

InClaS generic model consists of the components as follows (see also InClaS generic 
model in Fig. 4): 

• General scheme of interactivity. 

• Definition regarding an uncertain classification. 

• Interactive classification system’s structure, its modules and connections. 

• Suggested approaches for updating the classifier. 

There are also parameters identified which are to be determined in each InClaS ap¬ 
plication area (see Fig. 4). The choice of a learning algorithm and its settings as well 
as descriptive attributes is to be made in all classification tasks, and the interactive ap¬ 
proach makes no difference in this aspect. 



Fig. 4. Framework of InClaS. 
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4.1.1. General Scheme of Interactivity to be Implemented 

Fig. 5 shows how the interactivity is implemented into the general model of the clas¬ 
sification process. Blocks with solid line are elements of a traditional automatic classi¬ 
fication system. Blocks and arrows with interrupted lines are introduced to ensure inter¬ 
activity with a human expert in order to assign class value(s) for uncertainly classified 
instances. This includes the following functions: 

1. Capturing uncertain classifications in the classifier applying stage. 

2. Forwarding these instances and additional information to the expert. 

3. Receiving and processing the expert’s decision. 

4. Using expert-provided knowledge to update the classifier. 

The questions which arise from the classification system’s extension with interactiv¬ 
ity are resolved within the next subsections that concern other components of InClaS 
generic model. 

4.1.2. Definition of an Uncertain Classification 

To answer the question “What to ask to the expert?”, it is important to define the char¬ 
acteristics of instances which are uncertain to the classifier and could benefit from the 
expert’s perusal. Therefore, notion of terms used variously in machine learning literature 
- unclassified instance, instance with low classification confidence and uncertain clas¬ 
sification - are clarified and their meanings in the context of this research are defined 
for further use. 

Unclassified instance is an instance which was not covered by any rule (or a correspond¬ 
ing leaf in the decision tree) from the classifier’s model in the classifier applying stage. 

Taking into consideration the confidence which the classifier associates with the rule 
(or leaf) that is used to classify an instance, the classifier’s decision can be marked as not 
confident enough. Confidence is based on example distribution in the training set which 
was used to build the classifier. An instance is said to be classified with low confidence if 
the confidence level for the class assigned by the classifier is below the selected threshold. 

Uncertain classification includes both of above-mentioned aspects and is a term used 
to ascertain either unclassified or with low confidence classified instances. 
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Fig. 5. Framework of InClaS. 
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Regarding multi-label classification more sophisticated uncertain classification defi¬ 
nition is to be applied since more than one class can be assigned to an instance. This 
aspect as well as the method of achieving the most appropriate confidence levels for 
different data sets is outlined in next section along with InClaS particularization for 
multi-label classification tasks. 

4.1.3. Interactive Classification System s Structure 

The system’s structure holds part of the answer to the question “How to interact between 
the system and the expert?”. A modular structure is chosen for the interactive classifica¬ 
tion system. Fig. 6 shows actions typically performed in the interactive classification 
system, without the inner process details within modules. The user can provide data 
for classifier training (la), initiate classifier building (2a) and submit new instances to 
be classified (3a). If the classification can be made by rules in the Classifier, the user 
receives classification results as a response (3c). If there is an instance which cannot be 
certainly classified by the Classifier applying module, a request to the Interactivity mo¬ 
dule to handle the situation is sent (3d). The Interactivity module asks for an expert clas¬ 
sification of the instance through interface (3e); this is the situation when a request for 
a response is being sent from the system to the user, not vice versa. After receiving the 
expert’s feedback, the Interactivity module informs the user and updates the Example 
base with a new example that was built from the instance and the user-given classifica¬ 
tion to it (3g). Consequently, the Classifier can be updated. Techniques which an expert 
can use for decision making regarding instance classification are not considered in the 
scope of this work. The classification system accepts a single expert opinion. 

4.1.4. Suggested Approaches for Updating the Classifier 

To answer the question “How to update the classifier?” activities for accepting the ex¬ 
pert’s classification and updating the classifier in response to this decision are defined. 


(3 9 ) 



Interface 

module 



System’s user 

/ 

/ 






w fk ! SSL 


Expert 

(3d)- if necessary 


Fig. 6. Modules and main processes within the interactive classification system. 
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The task of the interactive system is to accept the expert’s decision and to update the clas¬ 
sifier in response to this decision. The main considerations are either to treat the expert’s 
classified instance as rule or use it as a training example. As a result, two approaches for 
expert-made decision incorporation into the classifier, which maintain consistency of 
the classifier, are identified - Incremental learning approach, which uses one of readily 
available incremental learning algorithms and authors proposed Threshold based static 
learning approach. Both approaches are described in detail in (Birzniece, 2010). 

InClaS generic model provides general-purpose components to develop either a sin¬ 
gle-label or a multi-label classification system. Due to the scope of problem to be solved, 
InClaS is further developed to serve classification tasks with multi-label class member¬ 
ship. It is described in the next section. 


4.2. InClaS Model for Multi-Label Classification 

To deal with multi-label classification, InClaS model has been extended with the follow¬ 
ing additional and specified components. See also InClaS model for multi-label classifi¬ 
cation in Fig. 4 and detailed descriptions of components in (Birzniece, 2010; Birzniece 
and Kirikova, 2011; Birzniece and Rudzajs, 2011). They are as follows: 

• Algorithm for detecting uncertain classification. 

• Method for determining the most appropriate confidence level. 

• Architecture of a classification system. 

4.2.1. Algorithm for Detecting Uncertain Classification 

Multi-label class membership requires an extended definition of uncertain classification 
and unclassified instance since each object can belong to an unknown number of classes 
which makes the classification task more complicated. One of widely used approaches 
for multi-label classification is binary relevance (Tsoumakas and Katakis, 2007), where 
it is suggested to split the initial problem into several single-label classification tasks. 
Therefore, the classification of a new instance comes from a combination of n single¬ 
label classifiers where each classifier predicts classification for just one of all n classes. 
If none of the classifiers predicts positive class, instance is defined as unclassified (thus 
also assigning uncertain classification mark). An algorithm for detecting uncertain clas¬ 
sification in multi-label domains defines that an instance is uncertainly classified if at the 
chosen (or default) confidence level none of actual classes of instance is predicted. 

To consider usefulness of user involvement in classification process and impact to 
number of misclassified instances the authors of this paper introduce several simple 
measures to be detected and evaluated later in experimental phase: 

• Partly correct or completely correctly classified instance (PC) - at least one of 
predicted classes is the actual class of an instance, Vj fl =£ 0, where 

Y t - actual label set of instance t, Zj - predicted label set of instance i. 

• Misclassified instance (M) - none of predicted classes is the actual class of an in¬ 
stance, Yi IT Zj 0 . 
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• True uncertain classification ( TU) - the classifier would misclassify an instance 
(M) (that is, with the confidence level 0.5 none of actual classes would be pre¬ 
dicted). 

• False uncertain classification ( FU) - the classifier would classify instance partly 
or completely correctly ( PC) (that is, with the confidence level 0.5 at least one of 
actual classes would be predicted). 

Certainly, it is desirable to strive for a classifier which maximizes the number of PC 
instances; however, if achieving high number if PC instances is hindered due to incom¬ 
pleteness of the classifier, e.g. because of small training set, the classification system 
should at least be aware of its “lack of knowledge” and be able to detect uncertain clas¬ 
sifications. 

4.2.2. Method for Determining the Most Appropriate Confidence Level 

It is assumed that a higher confidence level brings less misclassified instances, although 
it increases the number of uncertain classifications (instances below this confidence 
level) which in the interactive approach are passed to the expert. Therefore, the compro¬ 
mise should be achieved between the expert’s workload and the number of misclassi¬ 
fied instances left in the classification results. Different domains have various specifics 
regarding the confidence level. Both manual and automatic method for determining the 
most appropriate confidence level for each data set have been developed to address this 
issue (Birzniece, 2013). The goal of the method is to determine the most appropriate 
confidence level where number of misclassified instances (M) is minimal taking into 
consideration given constraints regarding the expert’s workload. 

4.2.3. Architecture of an Interactive Multi-Label Classification System 

Design of an interactive inductive learning based classification system for a multi-label 
classification task is guided by a five step procedure for designing intelligent systems 
by (Bielawski and Lewand, 1991). Design decisions for a university study course com¬ 
parison task are explained resulting in a more detailed system’s structure which defines 
particular inputs and outputs of the modules. This component of the InClaS model is 
detailed in the authors’ publications (Birzniece, 2011; Birzniece and Kirikova, 2011; 
Birzniece and Rudzajs, 2011). 

The developed InClaS generic model and its extension for multi-label classification 
provide a sufficient theoretical and methodical ground for implementing an interactive 
classification system as a software prototype. 


4.3. InCasS Prototype 

This subsection describes the main functionality of the prototype, paying attention to 
embodiment of InClaS model components into software. Data input and output is pro¬ 
vided through graphical user interface (GUI). The classification system extracts and 
saves the rules held in the classifier (in a text file) in a human-readable form. 
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Within the prototype already implemented classification algorithms and methods are 
used; basic learning algorithms are called from Weka software (Hall et al., 2009), multi¬ 
label classification methods which make use of them are implemented in Mulan (Tsou- 
makas et al., 2011) library. A prototype in the exploitation mode currently uses 11 static 
learning algorithms or method-algorithm combinations from Weka and Mulan, applying 
their default settings. 

To implement an interactivity scheme, the classifier’s application stage has been im¬ 
proved with the ability to trace the confidence of classification and intercept uncertain 
classifications. Classification results are presented to the user (expert), which can apprise 
classes assigned with different confidences and make his classification if no classifica¬ 
tion is given with the confidence 0.5 or more. 

To emphasize the novelty of development differences and improvements in compari¬ 
son to Weka tool and Mulan library are summarized. From this aspect the main InClaS 
contributions are: 

(1) The developed GUI for Mulan library (developers of Mulan do not provide 
GUI). 

(2) The ability for a system’s user to examine the classifier rule base conveniently (if 
a particular learning algorithm produces rules). 

(3) GUI and processing engine behind it for ensuring interactivity. 

Thus all together the InClaS prototype provides a unique environment for multi-label 
classification in a more user-friendly way than it was possible before as well as novel 
interactivity facilities between the classification system and its user. 


5. The Application of Inclas in Education 

This section describes the experimental plan and main results in practical evaluation of 
the InClaS model and its prototype in the domain of higher education, the university 
study course comparison in particular. The aim of experiments is to examine the utility 
of the InClaS framework, usability of the system’s prototype and evaluate the impact of 
chosen settings to study course comparison task. 

In order to assess an InClaS utility the number of misclassified instances, applying 
the standard non-interactive approach and the proposed interactive approach is to be 
compared. Regarding usefulness of the proposed solution in education area the follo¬ 
wing aspects are to be evaluated: 

• Verification of the thesis that this problem domain is not appropriate for traditional 
automatic machine learning solutions, whereas inductive learning methods based 
interactive multi-label classification system for supporting study course compari¬ 
son can provide acceptable solution. 

• Evaluation of a direct (using attributes achieved directly from full course descrip¬ 
tions) and indirect (using mediated attributes from course descriptions) study 
course comparison. 
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5.1. Experimental Plan 


Experimental settings are described in Table 1. 

Four setting combinations as separate stages of experiments are defined: (1) word 
vectors with automatic classification, (2) mediated attributes with automatic classifica¬ 
tion, (3) word vectors with InClaS, and (4) mediated attributes with InClaS. Stage 1 is 
preliminary to stage 3 and stage 2 precedes stage 4. 

Parameters of data sets are given in Table 2. 

The full data set consists of 79 examples from different European universities pro¬ 
viding Business Informatics related curricula, namely, 25 instances from Riga Technical 
University, 6 instances from University of Rostock, 31 from Vienna University of Tech¬ 
nology and 17 from University of Vienna. In a reduced set, the labels with less than 4 
examples are removed. Label density of a data set is the average number of labels of 
the examples divided by number of labels. Label cardinality of a data set is the average 
number of labels of the examples in this set. Distinct labelsets present the number of dif¬ 
ferent label combinations within a data set. Word vector based data set contains 1884 at¬ 
tributes representing appearance or absence of 1884 words encountered in study course 
description examples. Competency based data set contains 36 attributes representing 
e-CF competencies, one attribute describing study level and one attribute - number of 
ECTS credit points. 


Table 1 

Experimental settings for study course comparison 



Stage 1 

Stage 2 

Stage 3 

Stage 4 

Input data set 

Full study course 
descriptions 
(ex-racting 
word vectors in 
preprocessing) 

Competencies 
of study course 
(e-CF), number 
of credit points, 
study level 

Full study course 
descriptions 
(extracting 
word vectors in 
preprocessing) 

Competencies 
of study course 
(e-CF), number 
of credit points, 
study level 

Classification approach 

Automatic classification 

Interactive classification (InClaS) 

Classification algorithms 
(methods) 

20 classification algorithm-method 
combinations (from Weka and Mulari) 

4 best methods 
from Stage 1 

4 best methods 
from Stage 2 

Evaluation measures 

Hamming loss, Micro-average pre¬ 
cision, Micro-average recall, One- 
error, Coverage 

M, PC, FU, TU 



Table 2 

Study course data set 



No. of 
attributes 

No. of 
instances 

No. of 
classes 

Label 

density 

Label 

cardinality 

Distinct 

labelsets 

Full data set (word vectors) 

1884 

79 

25 

0.0620 

1.6203 

52 

Full data set (competencies) 

38 

79 

25 

0.0620 

1.6203 

52 

Reduced data set (competencies) 

38 

64 

12 

0.1341 

1.6094 

36 
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5.2. Main Experimental Results 


Table 3 shows 3 times repeated random sub-sampling validation results (in stage 3) of 
four methods which achieved the best results by means of Hamming loss, Micro-average 
precision, Micro-average recall. One-error, Coverage in stage 1. BR stands for the Bi¬ 
nary Relevance method. Classification measures hold the following correlations: 

PC + Misclassified (without interactivity ) = 1 (all classifications in an automatic 

manner). 

PC + TU + FU + Misclassified (with interactivity) = 1 (all classifications in an 

interactive manner). 

Misclassified (without interactivity) = TU + FU + Misclassified (with interactivity). 


Results in Table 3 should be interpreted as follows. Using the automatic classifica¬ 
tion where only partly or completely correct classifications (blue part of the table) and 
misclassifications (red part of the table) exist, 27% of instances would be PC (in case of 
RAkEL method) and 73% - misclassified. If the interactive approach is used, the number 
of PC remains the same; however, 33% of instances from previously misclassified are 
marked as uncertain to the classifier and given to the expert, reducing the number of mis¬ 
classified instances to 40%. Results in Table 3 show that without applying interactivity 
the number of misclassified instances is much higher for all methods. Note the assump¬ 
tion that the expert makes correct classifications to the instances passed to him. 

Table 4 represents results of stage 4 experiments. 

Table 3 

Interactive approach for direct study course comparison (word vectors) 


Method 

(algorithm) 

Partly 

correct 

(PC) 

True uncertain 

classification 

(TU) 

False uncertain 

classification 

(FU) 

Misclassified 

(with 

interactivity) 

Misclassified 

(without 

interactivity) 

RAkEL(J48 ) 

0.267 

0.333 

0.000 

0.400 

0.733 

BR(AdaBoost) 

0.100 

0.400 

0.000 

0.500 

0.900 

BR(Bagging) 

0.067 

0.600 

0.000 

0.333 

0.933 

BR(JRip) 

0.267 

0.367 

0.000 

0.366 

0,733 


Table 4 

Interactive approach for indirect study course comparison (competencies) 


Method 

(algorithm) 

Partly 

correct 

(PC) 

True uncertain 

classification 

(TU) 

False uncertain 

classification 

(FU) 

Misclassified 

(with 

interactivity) 

Misclassified 

(without 

interactivity) 

BR(NB) 

0.234 

0.633 

0.000 

0.133 

0.766 

BR(Bagging) 

0.167 

0.733 

0.000 

0.100 

0.833 

BR(AdaBoost) 

0.267 

0.433 

0.000 

0.300 

0.733 

BR(JRip) 

0.267 

0.367 

0.000 

0.366 

0,733 








Alike stage 3 results, the ability of the InClaS classification system to track uncertain 
classifications allows to decrease the number of misclassified instances, although results 
vary much between the methods used. Graphical representation of JRip algorithm results 
in Fig. 7 emphasizes the impact of the interactive approach even more. Without interac¬ 
tivity (Fig. 7 part A), all instances in the red column of the table would be misclassified 
reaching only 27% of PC. Such classification results do not encourage the use of the 
automatic classification in this problem domain. In turn, the interactive approach (Fig. 7 
part B) with the ability to handle uncertain classification makes it possible to save half 
of misclassified instances and assign to them correct classifications after the expert’s 
review. Thus, 37% of instances are misclassified, which, obviously, is not a great result, 
but is much more promising than 73% with the automatic classification. 

To all appearances, the given data set does not provide a complete concept descrip¬ 
tion as it was assumed when considering domain features. To consider the situation 
when the number of training examples regarding each class has increased, experiments 
with the reduced data set are carried out. The results lead to conclusion that interactive 
classification system improves its results and less frequently disturbs the expert when 
the training set grows in time. Therefore it is useful to spend expert’s time more in the 
initial period of classifier’s usage in order to obtain better classification results later. Fig. 
8 shows the difference between results in the data set with reduced number of classes 
where each class is described with slightly higher number of examples (part A) and the 
full data set which includes many underrepresented classes (part B). In reduced data 
set PC reach 50% of instances leaving 17% of instances for expert’s decision and also 
decreasing the number of misclassified instances. All these parameters are improved in 
comparison to the initial data set. 



a Partly or completely 

0.267 

correct (PC) 

0.367 °-267 

0.733 j 

L ■ Misclassified ( M) 


0.366 

Uncertain (TU + FU) 



A 

B 

Fig. 7. Test results of JRip algorithm with automatic (A) and interactive (B) classification. 


■ Partly or completely 

0.167 

correct (PC) 

0.367 0.267 



Fig. 8. Test results of JRip algorithm task with reduced (A) and full (B) course data set. 
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Experimental results also deny assumption that the indirect course comparison pro¬ 
vides better classification results than the direct comparison. That is, structured and 
meaningful information extraction from course descriptions produce attributes which do 
not surpass full course description usage to make word vector based attributes by means 
of number of misclassified instances and (partly) correct classifications. Both approach¬ 
es can be used, however, the indirect comparison currently requires much more expert’s 
work in attribute extraction phase since competencies are not accessible directly in course 
descriptions. If course descriptions are standardized, it makes the situation more conve¬ 
nient for such approach. As a disadvantage of word vector usage to define attributes its 
low semantic meaning should be mentioned. It does not provide useful knowledge to the 
expert as it only describes occurrences of different words in descriptions wherever in the 
text they appear - either preconditions or learning outcomes. Therefore, the knowledge 
about underlying communalities of the course content can be mined if meaningful at¬ 
tributes are used, like competencies which the study course provides. 

As example of rules generated by competencies-based classifier Fig. 9 shows a sec¬ 
tion of the JRip classifier which is highly understandable for a human. 

Each rule describes one study course based on comparison data set available for 
classifier training. Therefore, classification model of Riga Technical University course 
Enterprise Architecture and Requirements Engineering says that if other course provides 
competency Solution Development (competency B.4 regarding e-CF) than the courses 
are similar, otherwise they are not. Confidence for these rules are 73% (true for 8 in¬ 
stances, wrong for 3 in training data set) and 90%, respectively. 

This type of representation provides expert with easy to evaluate knowledge discov¬ 
ered directly from historical or on-demand created course comparisons between differ¬ 
ent educational institutions. Corresponding study courses are gathered by examining all 
models therefore one course can achieve more than one classification. 


Model for Enterprise Architecture and Requirements Engineering JRip rules: 


(B.4. Solution Deployment = 1) => 

EnterpriseArchitectureAndRequirementsEngineering=l (8.0/3.0) 

=> EnterpriseArchitectureAndRequirementsEngineering=0 (61.0/7.0) 

Number of Rules: 2 

Model for Quality Risk and Security Technologies JRip rules: 


(E.3. Risk Management =1) and (E.2. Project and Portfolio Management = 0) => 
QualityRiskAndSecurityTechnologies=l (6.0/1.0) 

=> QualityRiskAndSecurityTechnologies=0 (63.0/1.0) 

Number of Rules: 2 


Fig. 9. Excerpt of classification rales for study course comparison competencies-based data set. 
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6. Conclusions 

The analysis of the existing situation on automation of course description correspon¬ 
dence detection has identified that this task is distinctive and do not fit to traditional 
automatic solutions because of small available data set, semi-structured data sources and 
multi-label class membership. 

Having analysed computer supported educational document comparison and current 
interactive classification approaches regarding dealing with unclassified instances, the 
authors of this paper suggest InClaS framework, on which bases algorithms, methods 
and other components are defined and which allow to develop an interactive classifica¬ 
tion system for decreasing misclassified instances in domains where a human-expert. 

A prototype of an interactive multi-label classification system is developed which is 
adjusted for study course comparison task. Course correspondences between Business 
Informatics master study programme in Riga Technical University and courses of several 
corresponding study programmes in Europe are detected. Evaluation of the InClaS has 
been carried out which proved the ability to decrease the number of misclassified instanc¬ 
es significantly if uncertain classifications are detected and passed to the expert’s review. 

However, we can broaden proposed application areas of InClaS framework and do 
not stick only to educational domain. The following recommendations of InClaS ap¬ 
plication are drawn. 

The use of the interactive classification system is feasible in areas where: 

• Human-expert is available that can classify individual instances. 

• Problem domain is defined by the attributes which are comprehensible for the expert 
- not too overwhelming in amount and available in a human interpretable form. 

The interactive classification approach is more appropriate than the automatic clas¬ 
sification in areas where at least one of the following statements holds: 

• It is essential to receive a correct classification for as much instances as possible, 
and it is acceptable to invest the expert’s work and time to achieve it. 

• It is hard to extract or define domain features resulting in attributes which do not 
describe the underlying concept completely. 

• Only a small initial learning set is available and it is suspected not being repre¬ 
sentable. 

The theoretical and practical results provide opportunities for further research. 
Some of future investigation directions are defining more sophisticated similarity 
measures and considering other supervised and semi-supervised machine learning ap¬ 
proaches for the comparative analysis of university study courses, e.g., co-training and 
case-based reasoning. 
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Interaktyvios klasifikavimo sistemos taikymas universitetiniq 
studiju programoms palyginti 

Ilze BIRZNIECE, Peteris RUDZAJS, Diana KAL1BATIENE, 

Olegas VASILECAS, Edgars RENCIS 

Augant informacijos kiekiui atsirado poreikis jq klasifikuoti pagal apibreztus kriterijus. Si 
klasifikacijos problema yra aktuali ir aukstojo mokslo srityje, ieskant panasiij studiji} programij ir 
studiji} modulii}, kas suteiktp galimyb^ jgyvendinti studentij mainus tarp universitetij ir palengvin- 
tp studiji} modulii} administravimq. Siuo metu studiji} modulii} palyginimas ir administravimas yra 
rankinis darbas, kurj galima butij automatizuoti jdiegus intelektualiqsias bei adaptyviqsias siste- 
mas. Sios problemines srities duomenys daznai yra nestrukturizuoti, pateikti teksto pavidalu. Tai 
apsunkina klasifikavimo, o egzistuojantys tokiems uzdaviniams spr^sti algoritmai nepakankamai 
palengvina darb^. Straipsnio autoriai siulo klasifikavimo sprendinj, kuris leidzia is dalies auto¬ 
matizuoti klasifikavimo process, jtraukus ne tik dalykines srities ekspertus, bet ir intelektualias 
sistemas. Remiantis pasiulytu sprendiniu sukurtas jj realizuojantis prototipas ir atlikti bandymai, 
kurie parode siulomo metodo veiksmingumq. 


